Automated Cleanup Processing for Extracting Bibliographic Data from Biomedical Online Journals

نویسندگان

  • In Cheol Kim
  • Daniel X. Le
چکیده

An R&D division of the National Library of Medicine (NLM) has developed the Web-based Medical Article Records System (WebMARS) to create citations from online biomedical journals. This paper presents one important part of this system, the automated cleanup module that extracts bibliographic information from HTML-formatted text based on a rule-based approach. A learning scheme comparing the output of the cleanup module to the verified processing result is newly introduced to create and update cleanup rules automatically, thereby minimizing the manual effort for rule setting and improving the performance of the cleanup processing. Experimental results show that the proposed automated cleanup module can effectively detect and extract the bibliographic data of interest from HTML-formatted online journal articles using relevant rules identified through the learning process.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automated Labeling Of Biomedical Online Journal Articles

An automated labeling (AL) module has been developed to automate the extraction of bibliographic data (e.g., article title, authors, affiliation, abstract, and others) from online biomedical journals for the National Library of Medicine’s MEDLINE database. The AL module employs string matching, statistics, and fuzzy rule-based algorithms to identify segmented zones in an article’s HTML pages a...

متن کامل

Automated labeling of bibliographic data extracted from biomedical online journals

A prototype system has been designed to automate the extraction of bibliographic data (e.g., article title, authors, abstract, affiliation and others) from online biomedical journals to populate the National Library of Medicine’s MEDLINE® database. This paper describes a key module in this system: the labeling module that employs statistics and fuzzy rule-based algorithms to identify segmented ...

متن کامل

Automated Labeling from Biomedical Journals published in Foreign Languages

An automated labeling (AL) module is developed to produce bibliographic records such as English title, vernacular title, author, affiliation, and English abstract from biomedical articles published in foreign language journals. Optical character recognition (OCR) output from scanned biomedical journals is used in this labeling process. Since frequently occurring words in a zone are important fe...

متن کامل

Automated Labeling Algorithms for Biomedical Document Images

The National Library of Medicine (NLM) has developed an automated system, named Medical Article Records System (MARS), to process bibliographic data (title, authors, affiliation, abstract, etc.) in biomedical journal articles for its MEDLINE database. This paper describes a labeling module in the MARS, which automatically extract the bibliographic data in biomedical journal articles. The label...

متن کامل

Automated data entry system: performance issues

This paper discusses the performance of a system for extracting bibliographic fields from scanned pages in biomedical journals to populate MEDLINE®, the flagship database of the National Library of Medicine (NLM), and heavily used worldwide. This system consists of automated processes to extract the article title, author names, affiliations and abstract, and manual workstations for the entry of...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005